智能论文笔记

QT-Routenet: Improved GNN generalization to larger 5G networks by fine-tuning predictions from queueing theory

Bruno Klaus de Aquino Afonso , Lilian Berton

分类：机器学习

2022-07-13

为了促进5G机器学习的使用，国际电信联盟（ITU）在2021年提议的第二版是5G挑战中ITU AI/ML的第二版，来自82个国家/地区的1600多名参与者。这项工作详细介绍了第二位解决方案总体上，这也是图形神经网络挑战2021的获胜解决方案。我们在将模型应用于5G网络时解决了概括问题，该模型可能比观察到的途径更长，链路容量更长且链接能力更大在培训中。为了实现这一目标，我们建议首先提取与排队理论（QT）相关的强大特征，然后使用Routenet Graph神经网络（GNN）模型的修改对分析基线预测进行微调。所提出的解决方案比简单地使用Routenet更好地概括了，并设法将分析基线的10.42平均绝对百分比误差降低到1.45（合奏为1.27）。这表明，对已知鲁棒的近似模型进行小更改可能是提高准确性的有效方法，而不会损害概括。

translated by 谷歌翻译

Optimizing Diffusion Rate and Label Reliability in a Graph-Based Semi-supervised Classifier

Bruno Klaus de Aquino Afonso , Lilian Berton

分类：机器学习

2022-01-10

半监督学习得到了研究人员的关注，因为它允许其中利用未标记数据的结构来实现比监督方法更少的标签来实现竞争分类结果。本地和全局一致性（LGC）算法是最着名的基于图形的半监督（GSSL）分类器之一。值得注意的是，其解决方案可以写成已知标签的线性组合。这种线性组合的系数取决于参数$ \ alpha $，在随机步行中达到标记的顶点时，确定随时间的衰减。在这项工作中，我们讨论如何删除标记实例的自我影响可能是有益的，以及它如何与休留次误差。此外，我们建议尽量减少自动分化的休假。在此框架内，我们提出了估计标签可靠性和扩散速率的方法。优化扩散速率以频谱表示更有效地完成。结果表明，标签可靠性方法与强大的L1-NORM方法竞争，删除对角线条目会降低过度的风险，并导致参数选择的合适标准。

translated by 谷歌翻译

MolE: a molecular foundation model for drug discovery

Oscar Méndez-Lucio , Christos Nicolaou , Berton Earnshaw

分类：机器学习

2022-11-03

Models that accurately predict properties based on chemical structure are valuable tools in drug discovery. However, for many properties, public and private training sets are typically small, and it is difficult for the models to generalize well outside of the training data. Recently, large language models have addressed this problem by using self-supervised pretraining on large unlabeled datasets, followed by fine-tuning on smaller, labeled datasets. In this paper, we report MolE, a molecular foundation model that adapts the DeBERTa architecture to be used on molecular graphs together with a two-step pretraining strategy. The first step of pretraining is a self-supervised approach focused on learning chemical structures, and the second step is a massive multi-task approach to learn biological information. We show that fine-tuning pretrained MolE achieves state-of-the-art results on 9 of the 22 ADMET tasks included in the Therapeutic Data Commons.

translated by 谷歌翻译

Personalized Dialogue Generation with Persona-Adaptive Attention

Qiushi Huang , Yu Zhang , Tom Ko , Xubo Liu , Bo Wu , Wenwu Wang , Lilian Tang

分类：自然语言处理

2022-10-27

Persona-based dialogue systems aim to generate consistent responses based on historical context and predefined persona. Unlike conventional dialogue generation, the persona-based dialogue needs to consider both dialogue context and persona, posing a challenge for coherent training. Specifically, this requires a delicate weight balance between context and persona. To achieve that, in this paper, we propose an effective framework with Persona-Adaptive Attention (PAA), which adaptively integrates the weights from the persona and context information via our designed attention. In addition, a dynamic masking mechanism is applied to the PAA to not only drop redundant information in context and persona but also serve as a regularization mechanism to avoid overfitting. Experimental results demonstrate the superiority of the proposed PAA framework compared to the strong baselines in both automatic and human evaluation. Moreover, the proposed PAA approach can perform equivalently well in a low-resource regime compared to models trained in a full-data setting, which achieve a similar result with only 20% to 30% of data compared to the larger models trained in the full-data setting. To fully exploit the effectiveness of our design, we designed several variants for handling the weighted information in different ways, showing the necessity and sufficiency of our weighting and masking designs.

translated by 谷歌翻译

Kencorpus: A Kenyan Language Corpus of Swahili, Dholuo and Luhya for Natural Language Processing Tasks

Barack Wanjawa , Lilian Wanzare , Florence Indede , Owen McOnyango , Edward Ombui , Lawrence Muchemi

分类：自然语言处理

2022-08-25

土著非洲语言在人工智能中被归类为服务不足，并且数字包容性和信息获取差。挑战是如何在没有必要数据的情况下使用机器学习和深度学习模型。 Kencorpus是一种肯尼亚语言语料库，打算弥合有关如何收集和存储文本和语音数据的差距，足以启用数据驱动的解决方案，例如机器翻译，多语言社区中的问题回答和转录。 Kencorpus是一种主要在肯尼亚说的三种语言的语料库（文本和语音）：斯瓦希里语，Dholuo和Luhya（方言Lumarachi，Lulogooli和Lubukusu）。该语料库打算填补开发数据集的空白，该数据集可用于低资源语言的自然语言处理和机器学习任务。这些语言中的每一种都为语言语料库贡献了文本和语音数据。数据收集是由社区，学校和合作伙伴（媒体，出版商）的研究人员完成的。 Kencorpus有5,594个项目的集合，为4,442个文本（560万字）和1,152个语音文件（177小时）。基于这些数据，还开发了其他数据集，例如Dholuo和Luhya的POS标记集（分别为50,000和93,000个单词），来自Swahili文本（7,537 QA对）的问答对，以及将文本转换为Swahili（12,400句子）。数据集可用于机器学习任务，例如文本处理，注释和翻译。该项目还在QA任务的文本和机器学习语音和机器学习中为概念系统提供了证明，最初的结果证实了Kencorpus对机器学习社区的可用性。 Kencorpus是这些低资源语言的第一个此类语料库，并且是学习和共享类似作品的经验的基础。

translated by 谷歌翻译

A Holistic Approach to Undesired Content Detection in the Real World

Todor Markov , Chong Zhang , Sandhini Agarwal , Tyna Eloundou , Teddy Lee , Steven Adler , Angela Jiang , Lilian Weng

分类：自然语言处理 | 机器学习

2022-08-05

我们提出了一种整体方法，用于构建一个可实现的自然语言分类系统，以实现现实世界中的内容适度。这样一个系统的成功依赖于一系列精心设计和执行的步骤，包括内容分类法和标签说明的设计，数据质量控制，主动学习管道以捕获罕见事件以及使模型可靠的各种方法并避免过度拟合。我们的审核系统经过培训，可以检测一系列不希望的内容，包括性内容，可恨的内容，暴力，自我伤害和骚扰。这种方法概括为各种不同的内容分类法，可用于创建优于现成模型的高质量内容分类器。

translated by 谷歌翻译

Developing an NLP-based Recommender System for the Ethical, Legal, and Social Implications of Synthetic Biology

Damien Dablain , Lilian Huang , Brandon Sepulvado

分类：人工智能

2022-07-10

合成生物学是一个新兴领域，涉及有机体的工程和重新设计，例如粮食安全，健康和环境保护。因此，它对研究人员和政策制定者构成了许多道德，法律和社会影响（ELSI）。确保社会负责的合成生物学的各种努力正在进行中。政策制定是一条监管途径，其他举措则试图将社会科学家和伦理学家纳入合成生物学项目中。然而，鉴于合成生物学的疾病，它跨越的异质领域的数量以及许多道德问题的开放性质，它证明建立广泛的具体政策具有挑战性，包括合成生物学团队在内成功。本文提出了一种不同的方法，询问是否有可能根据自然语言处理（NLP）开发出良好的推荐模型，以将合成生物学家与有关其特定研究的ELSI信息联系起来？该推荐人是作为建立合成生物学知识系统（SBK）的较大项目的一部分开发的，以加速发现和探索合成生物学设计空间。我们的方法旨在提炼合成生物学家相关的伦理和社会科学信息，并将其嵌入合成生物学研究工作流程中。

translated by 谷歌翻译

Learning Sequential Descriptors for Sequence-based Visual Place Recognition

Riccardo Mereu , Gabriele Trivigno , Gabriele Berton , Carlo Masone , Barbara Caputo

分类：计算机视觉

2022-07-08

在机器人技术中，Visual Place识别是一个连续的过程，它作为输入视频流，以产生机器人在已知位置地图中的当前位置的假设。此任务需要针对实际应用的强大，可扩展和高效的技术。这项工作提出了使用顺序描述符对技术的详细分类法，突出了不同的机制，以融合各个图像的信息。实验结果的完整基准支持了这种分类，该基准提供了有关这些不同建筑选择的优势和劣势的证据。与现有的顺序描述方法相比，我们进一步研究了变压器而不是CNN骨架的生存能力，我们提出了一个名为SEQVLAD的新的临时序列级聚合器，该序列级别的聚合器在不同数据集中胜过先前的艺术状态。该代码可从https://github.com/vandal-vpr/vg-transformers获得。

translated by 谷歌翻译

Attack Techniques and Threat Identification for Vulnerabilities

Constantin Adam , Muhammed Fatih Bulut , Daby Sow , Steven Ocepek , Chris Bedell , Lilian Ngweta

分类：人工智能

2022-06-22

现代组织为其网络和应用程序漏洞扫描仪发现和报告的漏洞数量奋斗。因此，优先级和专注力变得至关重要，将有限的时间花在最高风险漏洞上。为此，对于这些组织而言，重要的是要了解漏洞的技术描述，而且要了解攻击者的观点。在这项工作中，我们使用机器学习和自然语言处理技术，以及几个公开可用的数据集，以提供攻击技术和威胁参与者的漏洞的可解释映射。这项工作通过预测最有可能使用哪种攻击技术来利用给定的漏洞以及哪些威胁行为者最有可能进行剥削来提供新的安全情报。缺乏标记的数据和不同的词汇使映射漏洞以规模攻击技术一个具有挑战性的问题，使用监督或无监督的（相似性搜索）学习技术无法轻松解决。为了解决这个问题，我们首先将漏洞映射到一组标准的共同弱点，然后将攻击技术的共同弱点映射到一组弱点。该方法得出的平均相互等级（MRR）为0.95，这是一种准确性，与最新系统报告的准确性相当。我们的解决方案已部署到IBM Security X-Force Red漏洞管理服务，并在生产中进行。该解决方案帮助安全从业人员帮助客户管理和优先考虑其漏洞，从演员

translated by 谷歌翻译

KenSwQuAD -- A Question Answering Dataset for Swahili Low Resource Language

Barack W. Wanjawa , Lilian D. A. Wanzare , Florence Indede , Owen McOnyango , Lawrence Muchemi , Edward Ombui

分类：自然语言处理 | 机器学习

2022-05-04

The need for Question Answering datasets in low resource languages is the motivation of this research, leading to the development of Kencorpus Swahili Question Answering Dataset, KenSwQuAD. This dataset is annotated from raw story texts of Swahili low resource language, which is a predominantly spoken in Eastern African and in other parts of the world. Question Answering (QA) datasets are important for machine comprehension of natural language for tasks such as internet search and dialog systems. Machine learning systems need training data such as the gold standard Question Answering set developed in this research. The research engaged annotators to formulate QA pairs from Swahili texts collected by the Kencorpus project, a Kenyan languages corpus. The project annotated 1,445 texts from the total 2,585 texts with at least 5 QA pairs each, resulting into a final dataset of 7,526 QA pairs. A quality assurance set of 12.5% of the annotated texts confirmed that the QA pairs were all correctly annotated. A proof of concept on applying the set to the QA task confirmed that the dataset can be usable for such tasks. KenSwQuAD has also contributed to resourcing of the Swahili language.

translated by 谷歌翻译